# Libraries
import pandas as pd
import numpy as np
import janitor
from pandas_profiling import ProfileReport
import plotly.express as px
from helpers import bar_plotter, merger, greenspace_plotter, df_std
# Read in distance to green spaces
distance = pd.read_csv("data/green_spaces.csv").clean_names()
# Read in neighbourhood ratings
neighbourhood = pd.read_csv("data/neighbourhood_rating.csv").clean_names()
#Read in community ratings
community = pd.read_csv("data/community_belonging.csv").clean_names()
distance
| featurecode | datecode | measurement | units | value | distance_to_nearest_green_or_blue_space | age | gender | urban_rural_classification | simd_quintiles | type_of_tenure | household_type | ethnicity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | S12000026 | 2013 | 95% Lower Confidence Limit, Percent | Percent Of Adults | 71.0 | A 5 minute walk or less | All | All | All | All | All | All | All |
| 1 | S12000045 | 2017 | Percent | Percent Of Adults | 59.0 | A 5 minute walk or less | All | All | All | All | All | Pensioners | All |
| 2 | S12000026 | 2014 | 95% Upper Confidence Limit, Percent | Percent Of Adults | 86.9 | A 5 minute walk or less | All | All | All | All | All | All | All |
| 3 | S12000026 | 2017 | 95% Upper Confidence Limit, Percent | Percent Of Adults | 80.9 | A 5 minute walk or less | All | All | All | All | All | All | All |
| 4 | S12000026 | 2017 | 95% Upper Confidence Limit, Percent | Percent Of Adults | 79.6 | A 5 minute walk or less | All | All | All | All | All | Pensioners | All |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 38446 | S92000003 | 2018 | Percent | Percent Of Adults | 26.0 | Within a 6-10 minute walk | All | All | All | All | All | All | Other |
| 38447 | S92000003 | 2018 | 95% Lower Confidence Limit, Percent | Percent Of Adults | 20.4 | Within a 6-10 minute walk | All | All | All | All | All | All | Other |
| 38448 | S92000003 | 2019 | 95% Upper Confidence Limit, Percent | Percent Of Adults | 7.8 | Don't Know | All | All | All | All | All | All | Other |
| 38449 | S92000003 | 2014 | 95% Lower Confidence Limit, Percent | Percent Of Adults | 15.8 | Within a 6-10 minute walk | All | All | All | All | All | All | Other |
| 38450 | S12000036 | 2018 | 95% Upper Confidence Limit, Percent | Percent Of Adults | 36.1 | Within a 6-10 minute walk | All | All | All | All | All | All | Other |
38451 rows × 13 columns
The datasets contain the percentage of adults that are classified according to the independent variable (in this case distance_to_nearest_green_or_blue_space ) given one other dependent variable, while the other variables are held constant (signified by value All).
distance.distance_to_nearest_green_or_blue_space.value_counts()
A 5 minute walk or less 10377 Within a 6-10 minute walk 10377 An 11 minute walk or more 10350 Don't Know 7347 Name: distance_to_nearest_green_or_blue_space, dtype: int64
community.walking_distance_to_nearest_greenspace.value_counts()
All 39375 Less than 10 minutes 3303 More than 10 minutes 828 Don't Know 105 Name: walking_distance_to_nearest_greenspace, dtype: int64
neighbourhood.neighbourhood_rating.value_counts()
Very good 9564 Fairly good 9564 Fairly poor 8781 Very poor 6828 No opinion 3318 Name: neighbourhood_rating, dtype: int64
community.community_belonging.value_counts()
Very strongly 9564 Fairly strongly 9564 Not very strongly 9564 Not at all strongly 9186 Don't know 5733 Name: community_belonging, dtype: int64
The above are the independent variables. All datasets contain a variable that represents walking distance to a green space, but the first one is binned differently with useful extra granularity, as it includes walks of 5 minutes or less and 6-10 walks separately. The other two datasets contain only whether the distance is more than or less than 10 minutes. Moreover, some of these records are contradictory (e.g. a percentage of people reporting being both less than and more than 10 minutes away from green spaces). For these reasons, walking_distance_to_nearest_greenspace from community and neighbourhood will be kept before joining the three datasets, then re-binned after the join.
Lastly, regional values will be filtered after the merge, as we are considering Scotland as a whole for the analysis questions, and other variables (e.g. urban or rural) can be assumed to contain at least some of the regional information.
The custom function merger() below joins the three datasets together in preparation for further analysis.
survey = merger(distance, community, neighbourhood)
# These are the resulting bins and counts for the new variable
survey.nearest_green_space.value_counts()
All 24235 Within a 6-10 minute walk 5379 An 11 minute walk or more 3980 A 5 minute walk or less 3460 Don't Know 61 Name: nearest_green_space, dtype: int64
# Filtering out regional values for analysis
survey = survey.query("featurecode == 'S92000003'").drop('featurecode', axis=1)
survey.sample(10)
| year | percent_adults | nearest_green_space | age | gender | urban_rural_classification | simd_quintiles | type_of_tenure | household_type | ethnicity | community_belonging | neighbourhood_rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 50001 | 2017 | 38.0 | All | All | All | Rural | All | All | All | All | Fairly strongly | All |
| 9299 | 2017 | 64.0 | A 5 minute walk or less | All | All | All | All | All | Adults | All | All | All |
| 213 | 2018 | 12.0 | An 11 minute walk or more | All | All | Urban | All | All | All | All | All | All |
| 33632 | 2014 | 18.0 | Within a 6-10 minute walk | All | All | All | All | All | Adults | All | All | All |
| 107271 | 2018 | 55.0 | All | All | All | All | All | Other | All | All | All | Very good |
| 15027 | 2018 | 3.0 | All | All | All | All | All | Social Rented | All | All | All | All |
| 113325 | 2013 | 9.0 | All | All | All | All | All | Social Rented | All | All | All | Fairly poor |
| 13873 | 2017 | 15.0 | An 11 minute walk or more | All | All | All | All | Other | All | All | All | All |
| 38852 | 2014 | 69.0 | A 5 minute walk or less | All | All | All | All | All | All | White | All | All |
| 32962 | 2016 | 21.0 | Within a 6-10 minute walk | 16-34 years | All | All | All | All | All | All | All | All |
survey.query("nearest_green_space == 'An 11 minute walk or more' & type_of_tenure == 'Owned Outright'").sort_values('year')
| year | percent_adults | nearest_green_space | age | gender | urban_rural_classification | simd_quintiles | type_of_tenure | household_type | ethnicity | community_belonging | neighbourhood_rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10994 | 2013 | 13.0 | An 11 minute walk or more | All | All | All | All | Owned Outright | All | All | All | All |
| 10932 | 2014 | 12.0 | An 11 minute walk or more | All | All | All | All | Owned Outright | All | All | All | All |
| 11531 | 2015 | 13.0 | An 11 minute walk or more | All | All | All | All | Owned Outright | All | All | All | All |
| 12333 | 2016 | 13.0 | An 11 minute walk or more | All | All | All | All | Owned Outright | All | All | All | All |
| 12325 | 2017 | 15.0 | An 11 minute walk or more | All | All | All | All | Owned Outright | All | All | All | All |
| 11849 | 2018 | 14.0 | An 11 minute walk or more | All | All | All | All | Owned Outright | All | All | All | All |
| 12822 | 2019 | 14.0 | An 11 minute walk or more | All | All | All | All | Owned Outright | All | All | All | All |
The above is an example query of the dataframe. It shows the average percentage of adults who are property owners and live further than a 10 minute walk from the nearest green space across the recorded years 2013-2019. In the analysis questions below, survey will be similarly queried and numeric results will come from the percent_adults column.
ProfileReport(survey)
Summarize dataset: 100%|██████████████████████████| 35/35 [00:02<00:00, 17.19it/s, Completed] Generate report structure: 100%|███████████████████████████████| 1/1 [00:01<00:00, 1.25s/it] Render HTML: 100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 3.03it/s]
By 'local access' only reports of a 5 minute walk or less will be considered.
(survey.query("nearest_green_space == 'A 5 minute walk or less'")
.sort_values('percent_adults', ascending=False)
).head(20)
| year | percent_adults | nearest_green_space | age | gender | urban_rural_classification | simd_quintiles | type_of_tenure | household_type | ethnicity | community_belonging | neighbourhood_rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 17741 | 2015 | 80.0 | A 5 minute walk or less | All | All | Rural | All | All | All | All | All | All |
| 19850 | 2014 | 79.0 | A 5 minute walk or less | All | All | Rural | All | All | All | All | All | All |
| 16161 | 2018 | 77.0 | A 5 minute walk or less | All | All | Rural | All | All | All | All | All | All |
| 4713 | 2017 | 75.0 | A 5 minute walk or less | All | All | Rural | All | All | All | All | All | All |
| 16127 | 2016 | 75.0 | A 5 minute walk or less | All | All | Rural | All | All | All | All | All | All |
| 16769 | 2013 | 75.0 | A 5 minute walk or less | All | All | Rural | All | All | All | All | All | All |
| 4763 | 2019 | 73.0 | A 5 minute walk or less | All | All | Rural | All | All | All | All | All | All |
| 20451 | 2014 | 72.0 | A 5 minute walk or less | All | All | All | All | Owned Mortgage/Loan | All | All | All | All |
| 8852 | 2014 | 72.0 | A 5 minute walk or less | All | All | All | All | All | With Children | All | All | All |
| 19360 | 2018 | 72.0 | A 5 minute walk or less | All | All | All | All | Owned Mortgage/Loan | All | All | All | All |
| 1904 | 2015 | 72.0 | A 5 minute walk or less | All | All | All | All | All | With Children | All | All | All |
| 20417 | 2015 | 72.0 | A 5 minute walk or less | All | All | All | All | Owned Mortgage/Loan | All | All | All | All |
| 10405 | 2014 | 71.0 | A 5 minute walk or less | 35-64 years | All | All | All | All | All | All | All | All |
| 20507 | 2013 | 71.0 | A 5 minute walk or less | All | All | All | All | Owned Mortgage/Loan | All | All | All | All |
| 8752 | 2014 | 71.0 | A 5 minute walk or less | All | Male | All | All | All | All | All | All | All |
| 9464 | 2018 | 71.0 | A 5 minute walk or less | All | All | All | All | All | With Children | All | All | All |
| 10404 | 2013 | 71.0 | A 5 minute walk or less | 16-34 years | All | All | All | All | All | All | All | All |
| 24180 | 2014 | 70.0 | A 5 minute walk or less | All | All | All | 80% least deprived | All | All | All | All | All |
| 9498 | 2015 | 70.0 | A 5 minute walk or less | 35-64 years | All | All | All | All | All | All | All | All |
| 20380 | 2017 | 70.0 | A 5 minute walk or less | All | All | All | All | Owned Mortgage/Loan | All | All | All | All |
Overall, we see that people from rural households consistently reported short distances, followed by households with a mortgage or with children. This does not show whether the percentage of people in complementary classifications (e.g. urban households, rented households or households with no children) is significantly lower.
df = (survey.query(
"nearest_green_space == 'A 5 minute walk or less' & urban_rural_classification != 'All'")
.sort_values('year')
)
df.head(10)
| year | percent_adults | nearest_green_space | age | gender | urban_rural_classification | simd_quintiles | type_of_tenure | household_type | ethnicity | community_belonging | neighbourhood_rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3181 | 2013 | 66.0 | A 5 minute walk or less | All | All | Urban | All | All | All | All | All | All |
| 16769 | 2013 | 75.0 | A 5 minute walk or less | All | All | Rural | All | All | All | All | All | All |
| 2902 | 2014 | 66.0 | A 5 minute walk or less | All | All | Urban | All | All | All | All | All | All |
| 19850 | 2014 | 79.0 | A 5 minute walk or less | All | All | Rural | All | All | All | All | All | All |
| 3085 | 2015 | 64.0 | A 5 minute walk or less | All | All | Urban | All | All | All | All | All | All |
| 17741 | 2015 | 80.0 | A 5 minute walk or less | All | All | Rural | All | All | All | All | All | All |
| 1131 | 2016 | 63.0 | A 5 minute walk or less | All | All | Urban | All | All | All | All | All | All |
| 16127 | 2016 | 75.0 | A 5 minute walk or less | All | All | Rural | All | All | All | All | All | All |
| 851 | 2017 | 63.0 | A 5 minute walk or less | All | All | Urban | All | All | All | All | All | All |
| 4713 | 2017 | 75.0 | A 5 minute walk or less | All | All | Rural | All | All | All | All | All | All |
df = (survey.query(
"nearest_green_space == 'A 5 minute walk or less' & type_of_tenure != 'All'")
.sort_values('year')
)
df.head(10)
| year | percent_adults | nearest_green_space | age | gender | urban_rural_classification | simd_quintiles | type_of_tenure | household_type | ethnicity | community_belonging | neighbourhood_rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20507 | 2013 | 71.0 | A 5 minute walk or less | All | All | All | All | Owned Mortgage/Loan | All | All | All | All |
| 10857 | 2013 | 67.0 | A 5 minute walk or less | All | All | All | All | Owned Outright | All | All | All | All |
| 14161 | 2013 | 68.0 | A 5 minute walk or less | All | All | All | All | Private Rented | All | All | All | All |
| 14083 | 2013 | 64.0 | A 5 minute walk or less | All | All | All | All | Social Rented | All | All | All | All |
| 13825 | 2013 | 60.0 | A 5 minute walk or less | All | All | All | All | Other | All | All | All | All |
| 14717 | 2014 | 64.0 | A 5 minute walk or less | All | All | All | All | Social Rented | All | All | All | All |
| 14191 | 2014 | 68.0 | A 5 minute walk or less | All | All | All | All | Private Rented | All | All | All | All |
| 12641 | 2014 | 69.0 | A 5 minute walk or less | All | All | All | All | Owned Outright | All | All | All | All |
| 20451 | 2014 | 72.0 | A 5 minute walk or less | All | All | All | All | Owned Mortgage/Loan | All | All | All | All |
| 13879 | 2014 | 59.0 | A 5 minute walk or less | All | All | All | All | Other | All | All | All | All |
The aim of the below helper function is to create a plot that shows this difference:
greenspace_plotter(survey, 'urban_rural_classification', 'short')
We can see the effect that re-binning the variable has had by creating the same plot with the raw data:
view = (
distance.query("urban_rural_classification != 'All' & distance_to_nearest_green_or_blue_space == 'A 5 minute walk or less'")
.groupby(['datecode', 'urban_rural_classification'])
.mean('value')
.reset_index()
.sort_values('datecode')
)
fig = (px.line(view, x="datecode", y="value", color='urban_rural_classification',
title=f'Percentage of adults living a 5 minute walk or less from green spaces', markers=True))
fig.show('notebook')
(survey.query("nearest_green_space == 'An 11 minute walk or more' & community_belonging == 'All' & neighbourhood_rating == 'All'")
.sort_values('percent_adults', ascending=False)
).head(20)
| year | percent_adults | nearest_green_space | age | gender | urban_rural_classification | simd_quintiles | type_of_tenure | household_type | ethnicity | community_belonging | neighbourhood_rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 40817 | 2017 | 22.0 | An 11 minute walk or more | All | All | All | All | All | All | Other | All | All |
| 40802 | 2019 | 21.0 | An 11 minute walk or more | All | All | All | All | All | All | Other | All | All |
| 13947 | 2014 | 21.0 | An 11 minute walk or more | All | All | All | All | Other | All | All | All | All |
| 13836 | 2019 | 20.0 | An 11 minute walk or more | All | All | All | All | Other | All | All | All | All |
| 40966 | 2013 | 18.0 | An 11 minute walk or more | All | All | All | All | All | All | Other | All | All |
| 36486 | 2017 | 18.0 | An 11 minute walk or more | All | All | All | All | All | Pensioners | All | All | All |
| 40820 | 2014 | 18.0 | An 11 minute walk or more | All | All | All | All | All | All | Other | All | All |
| 24021 | 2017 | 18.0 | An 11 minute walk or more | All | All | All | 20% most deprived | All | All | All | All | All |
| 24034 | 2016 | 17.0 | An 11 minute walk or more | All | All | All | 20% most deprived | All | All | All | All | All |
| 27262 | 2019 | 17.0 | An 11 minute walk or more | 65 years and over | All | All | All | All | All | All | All | All |
| 27302 | 2019 | 17.0 | An 11 minute walk or more | All | All | All | All | All | Pensioners | All | All | All |
| 13620 | 2017 | 16.0 | An 11 minute walk or more | All | All | All | All | Social Rented | All | All | All | All |
| 28968 | 2013 | 16.0 | An 11 minute walk or more | All | All | All | All | All | Pensioners | All | All | All |
| 22968 | 2015 | 16.0 | An 11 minute walk or more | All | All | All | All | All | Pensioners | All | All | All |
| 13859 | 2013 | 16.0 | An 11 minute walk or more | All | All | All | All | Other | All | All | All | All |
| 41002 | 2015 | 16.0 | An 11 minute walk or more | All | All | All | All | All | All | Other | All | All |
| 40801 | 2018 | 16.0 | An 11 minute walk or more | All | All | All | All | All | All | Other | All | All |
| 24391 | 2014 | 15.0 | An 11 minute walk or more | All | All | All | 20% most deprived | All | All | All | All | All |
| 24018 | 2015 | 15.0 | An 11 minute walk or more | All | All | All | 20% most deprived | All | All | All | All | All |
| 15569 | 2019 | 15.0 | An 11 minute walk or more | All | All | Rural | All | All | All | All | All | All |
Once community and neighbourhood ratings are held constant, we can see that the top characteristics for people who live far from green spaces are: non-white ethnicity, pensioners, lowest SIMD quintile, social rented or other tenure type.
greenspace_plotter(survey, 'ethnicity', 'long')
greenspace_plotter(survey, 'household_type', 'long')
It has been pointed out above that perhaps the difference in average values between group types, e.g. rented or owned households, is more significant than the absolute highest value. In order to find which group types have a higher difference in means, let us calculate the standard deviation of each variable for people who live an 11 min walk or further from a green space. Begin with type_of_tenure:
# This calculates the average for each group
view = (survey.query("nearest_green_space == 'An 11 minute walk or more' & type_of_tenure != 'All'")
.groupby(['type_of_tenure'])
.mean()
.drop('year', axis=1)
.sort_values('percent_adults')
)
view
| percent_adults | |
|---|---|
| type_of_tenure | |
| Owned Mortgage/Loan | 10.714286 |
| Private Rented | 13.000000 |
| Owned Outright | 13.428571 |
| Social Rented | 13.714286 |
| Other | 16.571429 |
# The standard deviation is:
view.std()
percent_adults 2.091284 dtype: float64
Now calculate all the standard deviations for possible predictors to identify groups of interest. Note that the custom function just does the above for all predictors.
df_std(survey, 'nearest_green_space', 'An 11 minute walk or more')
| year | percent_adults | nearest_green_space | age | gender | urban_rural_classification | simd_quintiles | type_of_tenure | household_type | ethnicity | community_belonging | neighbourhood_rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| percent_adults | 0.843297 | NaN | NaN | 3.94252 | 1.313198 | 0.303046 | 2.424366 | 2.091284 | 2.755329 | 3.939595 | 16.640635 | 24.446819 |
Out of all the demographic categories, age, ethnicity, household type, SIMD quintiles and type of tenure seem to explain most of the variation.
greenspace_plotter(survey, 'age', 'long')
Seems like the age plot generally matches what we would expect, as people aged 65 and over seem to be the ones further away from green spaces, same with pensioners. Given that this data is self-reported, it may be the case that people lose easy access to green spaces as they become older and takes them longer to walk, rather than the data reflecting people's houses being actually further in distance.
Let us now plot neighbourhood rating and community belonging to see whether there is a relationship between them and distance to a green space:
greenspace_plotter(survey, 'neighbourhood_rating', 'long')
greenspace_plotter(survey, 'community_belonging', 'long')
Household and community ratings seem to remain constant across the years. Among adults who live far from green spaces, most consistently report feeling very or fairly positive about their neighbourhoods and communities. Let us make use of another helper function, bar_plotter, to plot these variables relative to each other. The plot shows average percentage of adults across the years 2013-2019, for all of Scotland:
bar_plotter(survey, 'community_belonging', 'percent_adults', 'nearest_green_space')
We can see that the difference between more than or less than a 10 minute walk for those who feel very strongly about their communities is of 2%, and the same and 0.4% for those who feel fairly strongly. Similarly the difference in walking distances is not very high for those who feel negatively about their communities.
bar_plotter(survey, 'neighbourhood_rating', 'percent_adults', 'nearest_green_space')
Overall, there seems to be little relationship between how people rate their communities and neighbourhoods and how far they live from green spaces.
bar_plotter(survey, 'age', 'percent_adults', 'nearest_green_space')
In this plot we can appreciate that the age group shows more of a relationship with walking distance to green spaces. For people aged 65 or more, 8% less live within 5 min, and 7% more live further than 10 min.
bar_plotter(survey, 'urban_rural_classification', 'percent_adults', 'nearest_green_space')
As a final example, we can visualise here the difference between rural and urban households. The difference is 12% for 5 min walking distance.
This analysis focuses on access to green spaces in Scotland. Data comes from the Scottish Household Survey, 2013-2019, and is self-reported.
In terms of the demographics of people who live close to green spaces (defined as being within a 5 minute walk): we conclude that rural households, followed by households with a mortgage, and households with children, score the highest percentages. Variation across the years is generally minor, suggesting that access is not increasing.
In terms of people who live far from green spaces (more than a 10 min walk): non-ethnic-white households, pensioner/aged 65+ households, and households in the lowest SIMD quitile generally score the highest percentages.
Community belonging and neighbourhood rating show no relationship with green space access, as people seem to consistently report feeling fairly/very positive regardless of their access. The other report in this project focuses on building a model capable of better explaining and predicting these ratings.